Search CORE

5 research outputs found

Proactive Interference-aware Resource Management in Deep Learning Training Cluster

Author: Yeung Ging-Fung
Publication venue: Lancaster University
Publication date: 01/01/2022
Field of study

Deep Learning (DL) applications are growing at an unprecedented rate across many domains, ranging from weather prediction, map navigation to medical imaging. However, training these deep learning models in large-scale compute clusters face substantial challenges in terms of low cluster resource utilisation and high job waiting time. State-of-the-art DL cluster resource managers are needed to increase GPU utilisation and maximise throughput. While co-locating DL jobs within the same GPU has been shown to be an effective means towards achieving this, co-location subsequently incurs performance interference resulting in job slowdown. We argue that effective workload placement can minimise DL cluster interference at scheduling runtime by understanding the DL workload characteristics and their respective hardware resource consumption. However, existing DL cluster resource managers reserve isolated GPUs to perform online profiling to directly measure GPU utilisation and kernel patterns for each unique submitted job. Such a feedback-based reactive approach results in additional waiting times as well as reduced cluster resource efficiency and availability. In this thesis, we propose Horus: an interference-aware and prediction-based DL cluster resource manager. Through empirically studying a series of microbenchmarks and DL workload co-location combinations across heterogeneous GPU hardware, we demonstrate the negative effects of performance interference when colocating DL workload, and identify GPU utilisation as a general proxy metric to determine good placement decisions. From these findings, we design Horus, which in contrast to existing approaches, proactively predicts GPU utilisation of heterogeneous DL workload extrapolated from the DL model computation graph features when performing placement decisions, removing the need for online profiling and isolated reserved GPUs. By conducting empirical experimentation within a medium-scale DL cluster as well as a large-scale trace-driven simulation of a production system, we demonstrate Horus improves cluster GPU utilisation, reduces cluster makespan and waiting time, and can scale to operate within hundreds of machines

Lancaster E-Prints

Towards GPU Utilization Prediction for Cloud Deep Learning

Author: Borowiec Damian
Friday Adrian
Garraghan Peter
Harper R.H.R.
Yeung Ging-Fung
Publication venue: USENIX Association
Publication date: 01/05/2020
Field of study

Understanding the GPU utilization of Deep Learning (DL) workloads is important for enhancing resource-efficiency and cost-benefit decision making for DL frameworks in the cloud. Current approaches to determine DL workload GPU utilization rely on online profiling within isolated GPU devices, and must be performed for every unique DL workload submission resulting in resource under-utilization and reduced service availability. In this paper, we propose a prediction engine to proactively determine the GPU utilization of heterogeneous DL workloads without the need for in-depth or isolated online profiling. We demonstrate that it is possible to predict DL workload GPU utilization via extracting information from its model computation graph. Our experiments show that the prediction engine achieves an RMSLE of 0.154, and can be exploited by DL schedulers to achieve up to 61.5% improvement to GPU cluster utilization

Lancaster E-Prints

An Empirical Study of Inter-cluster Resource Orchestration within Federated Cloud Clusters

Author: Elkhatib Yehia
Garraghan Peter
Lindsay Dominic
Yeung Ging-Fung
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 13/10/2021
Field of study

Federated clusters are composed of multiple independent clusters of machines interconnected by a resource management system, and possess several advantages over centralized cloud datacenter clusters including seamless provisioning of applications across large geographic regions, greater fault tolerance, and increased cluster resource utilization. However, while existing resource management systems for federated clusters are capable of improving application intra-cluster performance, they do not capture inter-cluster performance in their decision making. This is important given federated clusters must execute a wide variety of applications possessing heterogeneous system architectures, which are a impacted by unique inter-cluster performance conditions such as network latency and localized cluster resource contention. In this work we present an empirical study demonstrating how inter-cluster performance conditions negatively impact federated cluster orchestration systems. We conduct a series of micro-benchmarks under various cluster operational scenarios showing the critical importance in capturing inter-cluster performance for resource orchestration in federated clusters. From this benchmark, we determine precise limitations in existing federated orchestration, and highlight key insights to design future orchestration systems. Findings of notable interest entail different application types exhibiting innate performance affinities across various federated cluster operational conditions, and experience substantial performance degradation from even minor increases to latency (8.7x) and resource contention (12.0x) in comparison to centralized cluster architectures

Enlighten

Lancaster E-Prints

Trimmer: Cost-Efficient Deep Learning Auto-tuning for Cloud Datacenters

Author: Borowiec Damian
Friday Adrian
Garraghan Peter
Harper R.H.R.
Yeung Ging-Fung
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 11/05/2022
Field of study

Cloud datacenters capable of provisioning high performance Machine Learning-as-a-Service (MLaaS) at reduced resource cost is achieved via auto-tuning: automated tensor program optimization of Deep Learning models to minimize inference latency within a hardware device. However given the extensive heterogeneity of Deep Learning models, libraries, and hardware devices, performing auto-tuning within Cloud datacenters incurs a significant time, compute resource, and energy cost of which state-of-the-art auto-tuning is not designed to mitigate. In this paper we propose Trimmer, a high performance and cost-efficient Deep Learning auto-tuning framework for Cloud datacenters. Trimmer maximizes DL model performance and tensor program cost-efficiency by preempting tensor program implementations exhibiting poor optimization improvement; and applying an ML-based filtering method to replace expensive low performing tensor programs to provide greater likelihood of selecting low latency tensor programs. Through an empirical study exploring the cost of DL model optimization techniques, our analysis indicates that 26-43% of total energy is expended on measuring tensor program implementations that do not positively contribute towards auto-tuning. Experiment results show that Trimmer achieves high auto-tuning cost-efficiency across different DL models, and reduces auto-tuning energy use by 21.8-40.9% for Cloud clusters whilst achieving DL model latency equivalent to state-of-the-art techniques

Lancaster E-Prints

DOPpler: Parallel Measurement Infrastructure for Auto-tuning Deep Learning Tensor Programs

Author: Borowiec Damian
Friday Adrian
Garraghan Peter
Harper R.H.R.
Yeung Ging-Fung
Publication venue
Publication date: 15/05/2023
Field of study

The heterogeneity of Deep Learning models, libraries, and hardware poses an important challenge for improving model inference performance. Auto-tuners address this challenge via automatic tensor program optimization towards a target-device. However, auto-tuners incur a substantial time cost to complete given their design necessitates performing tensor program candidate measurements serially within an isolated target-device to minimize latency measurement inaccuracy. In this paper we propose DOPpler, a parallel auto-tuning measurement infrastructure. DOPpler allows for considerable auto-tuning speedup over conventional approaches whilst maintaining high-quality tensor program optimization. DOPpler accelerates the auto-tuning process by proposing a parallel execution engine to efficiently execute candidate tensor programs in parallel across the CPU-host and GPU target-device, and overcomes measurement inaccuracy by introducing a high-precision on-device measurement technique when measuring tensor program kernel latency. DOPpler is designed to automatically calculate the optimal degree of parallelism to provision fast and accurate auto-tuning for different tensor programs, auto-tuners and target-devices. Experiment results show that DOPpler reduces total auto-tuning time by 50.5% on average whilst achieving optimization gains equivalent to conventional auto-tuning infrastructure

Lancaster E-Prints